Skip to content

Conversation

@FlorentRevest
Copy link
Collaborator

This is a draft at the moment because I have a few unresolved questions such as:

  • whether this is the right direction architecturally speaking
  • how to express the dependency to fsck commands in syz-manager and the fsck_test (right now, if an fsck command is unavailable it will log a warning and move on)
  • more extensive test coverage (I only tried an ext2 and ext4 image but it could be nice to add more fs there)

@FlorentRevest FlorentRevest marked this pull request as draft November 18, 2024 19:18
@dvyukov
Copy link
Collaborator

dvyukov commented Nov 19, 2024

Architecturally this looks good to me. cc @a-nogikh

We need to figure out what to do with error handling. Nobody is reading manager logs. It can log on error level, then these surface in out logs alerting, but if it happens repeatedly, it may be a problem. Is there any rate-limiting in our logs alerting?

Also now if I see "mount_in_repro", I don't really know if it means non-corrupted, or fsck failed. Do we want "mount_in_repro_non_corrupted"?

Maybe we could also append any errors to the log file. If there is a failure with fsck, we will always want to know more (why /how it failed).

@a-nogikh
Copy link
Collaborator

a-nogikh commented Nov 19, 2024

Regarding how we could store this information in the DB and how we could check for corrupted images in the reporting rules (the question that popped up in offline discussions).

I'd vote for not splitting the dashapi.MountInRepro type into corrupted and non-corrupted images. Maybe we could add more flags/attributes into the Asset entity? E.g. whether it's corrupted, what's the fs type.

type Asset struct {
Type dashapi.AssetType
DownloadURL string
CreateDate time.Time
}

If we want to store fsck output logs, that could also go there as a text reference like in other places.

Log int64 // reference to CrashLog text entity

I think that should be quite straightforward implementation-wise.


Checking for corrupted images in the Reporting filter is unfortunately much more tricky.

// Filter can be used to conditionally skip this reporting or hold off reporting.
Filter ReportingFilter `json:"-"`

ReportingFilter func(bug *Bug) FilterResult

First one important question. There can be multiple crashes per bug, some of which may include corrupted images, some may not. Sometimes there are also some unrelated syz_mount_image calls left, which could also be either corrupted or not.

What kinds of report filtering do we want to support w.r.t. fsck results?

  1. Report if at least one of the reproducers mounts any non-corrupted fs image?
  2. Only look at the images of the specific filesystem type? Do we actually know what fs type to look at?
  3. Report only if there's at least one reproducer that mounts only non-corrupted images?

The answer to this question may affect the design considerations.

If we go with 3, we might also want to recalculate the crash priority depending on whether the associated reproducer mounts any corrupted fs images. See this method:

func (crash *Crash) UpdateReportingPriority(c context.Context, build *Build, bug *Bug) {


It looks like we might have to transform the Reporting.Filter() callback to accept (*Bug, *Crash).

In the DB, we always reference the exact crash used in the specific reporting stage:

CrashID int64 // crash that we've last reported in this reporting

So whenever we have to look at the previous stages, e.g. in the case below, we could pre-fetch all referenced crashes:

switch reporting.Filter(bug) {

There's a helper that makes it simpler:

// dependencyLoader encapsulates the repetitive logic of mass loading referenced entities.
type dependencyLoader[T any] struct {

In the case when we decide whether to report or not and don't yet have CrashID set, I think we could first findCrashForBug and pass the candidate to needReport. That will likely make it much slower (unless we invent some optimizations), but it should work.

reporting, bugReporting, _, _, _, err := needReport(c, typ, state, bug)
if err != nil || reporting == nil {
return nil, err
}
crash, crashKey, err := findCrashForBug(c, bug)
if err != nil {
return nil, err

@FlorentRevest FlorentRevest force-pushed the fsck branch 3 times, most recently from b78995a to 93e9e62 Compare November 28, 2024 11:48
FlorentRevest added a commit to FlorentRevest/syzkaller that referenced this pull request Nov 28, 2024
As part of google#5518, I'm adding fsck logs as annotation to the mounted
file system assets. For this, I need a variety of fsck-like commands in
the ci environment as well as eventually in the production environment.
FlorentRevest added a commit to FlorentRevest/syzkaller that referenced this pull request Nov 28, 2024
As part of google#5518, I'm adding fsck logs as annotation to the mounted
file system assets. For this, I need a variety of fsck-like commands in
the ci environment as well as eventually in the production environment.
@FlorentRevest
Copy link
Collaborator Author

Could you have another look ? I think we are now in a much better state in terms of logic to run fsck (I extended the syscall descriptions of file system mounts and expanded my tests range to 16 different file systems). But I'm still a bit unsure about the database storage story (I used a uint64 log entity but also piggy-backed on the assets list to render in the UI so it's a bit ugly that I had to introduce an asset type for something that is not actually uploaded like the other assets). Also I haven't explored much of the testing story yet since it's all over the place, let me know if you have any tests in mind you'd like to see.

The tests Should (TM) pass once #5549 is submitted and the env docker image published.

FlorentRevest added a commit to FlorentRevest/syzkaller that referenced this pull request Nov 28, 2024
As part of google#5518, I'm adding fsck logs as annotation to the mounted
file system assets. For this, I need a variety of fsck-like commands in
the ci environment as well as eventually in the production environment.
@FlorentRevest FlorentRevest force-pushed the fsck branch 3 times, most recently from 9120da8 to feda84b Compare November 29, 2024 10:46
github-merge-queue bot pushed a commit that referenced this pull request Nov 29, 2024
As part of #5518, I'm adding fsck logs as annotation to the mounted
file system assets. For this, I need a variety of fsck-like commands in
the ci environment as well as eventually in the production environment.
Copy link
Collaborator

@a-nogikh a-nogikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
I've commented on the logic that is to be tested and what still needs to be adjusted.

@FlorentRevest FlorentRevest force-pushed the fsck branch 2 times, most recently from d2a9001 to 6d08674 Compare December 4, 2024 20:51
a-nogikh
a-nogikh previously approved these changes Dec 5, 2024
Syscall attributes are extended with a fsck command field which lets
file system mount definitions specify a fsck-like command to run. This
is required because all file systems have a custom fsck command
invokation style.

When uploading a compressed image asset to the dashboard, syz-manager
also runs the fsck command and logs its output over the dashapi.

The dashboard logs these fsck logs into the database.

This has been requested by fs maintainer Ted Tso who would like to
quickly understand whether a filesystem is corrupted or not before
looking at a reproducer in more details. Ultimately, this could be used
as an early triage sign to determine whether a bug is obviously
critical.
@a-nogikh a-nogikh added this pull request to the merge queue Dec 9, 2024
Merged via the queue into google:master with commit deb7287 Dec 9, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants